dpci: replace tasklet with softirq
The existing tasklet mechanism has a single global spinlock that is
taken every-time the global list is touched. And we use this lock quite
a lot - when we call do_tasklet_work which is called via an softirq and
from the idle loop. We take the lock on any operation on the
tasklet_list.
The problem we are facing is that there are quite a lot of tasklets
scheduled. The most common one that is invoked is the one injecting the
VIRQ_TIMER in the guest. Guests are not insane and don't set the
one-shot or periodic clocks to be in sub 1ms intervals (causing said
tasklet to be scheduled for such small intervalls).
The problem appears when PCI passthrough devices are used over many
sockets and we have an mix of heavy-interrupt guests and idle guests.
The idle guests end up seeing 1/10 of its RUNNING timeslice eaten by
the hypervisor (and 40% steal time).
The mechanism by which we inject PCI interrupts is by hvm_do_IRQ_dpci
which schedules the hvm_dirq_assist tasklet every time an interrupt is
received. The callchain is:
_asm_vmexit_handler
-> vmx_vmexit_handler
->vmx_do_extint
-> do_IRQ
-> __do_IRQ_guest
-> hvm_do_IRQ_dpci
tasklet_schedule(&dpci->dirq_tasklet);
[takes lock to put the tasklet on]
[later on the schedule_tail is invoked which is 'vmx_do_resume']
vmx_do_resume
-> vmx_asm_do_vmentry
-> call vmx_intr_assist
-> vmx_process_softirqs
-> do_softirq
[executes the tasklet function, takes the
lock again]
While on other CPUs they might be sitting in a idle loop and invoked to
deliver an VIRQ_TIMER, which also ends up taking the lock twice: first
to schedule the v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted
to the guests' BLOCKED_state); then to execute it - which is accounted
for in the guest's RUNTIME_state.
The end result is that on a 8 socket machine with PCI passthrough,
where four sockets are busy with interrupts, and the other sockets have
idle guests - we end up with the idle guests having around 40% steal
time and 1/10 of its timeslice (3ms out of 30 ms) being tied up taking
the lock. The latency of the PCI interrupts delieved to guest is also
hindered.
With this patch the problem disappears completly. That is removing the
lock for the PCI passthrough use-case (the 'hvm_dirq_assist' case) by
not using tasklets at all.
The patch is simple - instead of scheduling an tasklet we schedule our
own softirq - HVM_DPCI_SOFTIRQ, which will take care of running
'hvm_dirq_assist'. The information we need on each CPU is which
'struct hvm_pirq_dpci' structure the 'hvm_dirq_assist' needs to run on.
That is simple solved by threading the 'struct hvm_pirq_dpci' through a
linked list. The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having 'schedule_dpci_for'
ignore any subsequent calls for an domain which has already been
scheduled.
== Code details ==
Most of the code complexity comes from the '->dom' field in the
'hvm_pirq_dpci' structure. We use it for ref-counting and as such it
MUST be valid as long as STATE_SCHED bit is set. Whoever clears the
STATE_SCHED bit does the ref-counting and can also reset the '->dom'
field.
To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially (first time
the domain uses the pirq), the 'hvm_pirq_dpci->dom' field is set to
NULL as it is allocated. On subsequent calls in to 'pt_irq_create_bind'
the ->dom is whatever it had last time.
As this is the initial call (which QEMU ends up calling when the guest
writes an vector value in the MSI field) we MUST set the '->dom' to a
the proper structure (otherwise we cannot do proper ref-counting).
The mechanism to tear it down is more complex as there are three ways
it can be executed. To make it simpler everything revolves around
'pt_pirq_softirq_active'. If it returns -EAGAIN that means there is an
outstanding softirq that needs to finish running before we can continue
tearing down. With that in mind:
a) pci_clean_dpci_irq. This gets called when the guest is being
destroyed. We end up calling 'pt_pirq_softirq_active' to see if it
is OK to continue the destruction.
The scenarios in which the 'struct pirq' (and subsequently the
'hvm_pirq_dpci') gets destroyed is when:
- guest did not use the pirq at all after setup.
- guest did use pirq, but decided to mask and left it in that state.
- guest did use pirq, but crashed.
In all of those scenarios we end up calling 'pt_pirq_softirq_active'
to check if the softirq is still active. Read below on the
'pt_pirq_softirq_active' loop.
b) pt_irq_destroy_bind (guest disables the MSI). We double-check that
the softirq has run by piggy-backing on the existing
'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
We add the extra call to 'pt_pirq_softirq_active' in
'pt_pirq_cleanup_check'.
NOTE: Guests that use event channels unbind first the event channel
from PIRQs, so the 'pt_pirq_cleanup_check' won't be called as 'event'
is set to zero. In that case we either clean it up via the a) or c)
mechanism.
There is an extra scenario regardless of 'event' being set or not:
the guest did 'pt_irq_destroy_bind' while an interrupt was triggered
and softirq was scheduled (but had not been run). It is OK to still
run the softirq as hvm_dirq_assist won't do anything (as the flags
are set to zero). However we will try to deschedule the softirq if
we can (by clearing the STATE_SCHED bit and us doing the
ref-counting).
c) pt_irq_create_bind (not a typo). The scenarios are:
- guest disables the MSI and then enables it (rmmod and modprobe in
a loop). We call 'pt_pirq_reset' which checks to see if the
softirq has been scheduled. Imagine the 'b)' with interrupts in
flight and c) getting called in a loop.
We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and waiting
(cpu_relax). We cannot call 'process_pending_softirqs' as it might
result in a dead-lock. hvm_dirq_assist will be executed and then the
softirq will clear 'state' which signals that that we can re-use the
'hvm_pirq_dpci' structure. In case this softirq is scheduled on a
remote CPU the softirq will run on it as the semantics behind an
softirq is that it will execute within the guest interruption.
- we hit once the error paths in 'pt_irq_create_bind' while an
interrupt was triggered and softirq was scheduled.
If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq has
stashed it beforehand. If the softirq is STATE_SCHED and we are
successful in clearing it, we do the ref-counting and clear the '->dom'
field. Otherwise we let the softirq continue on and the '->dom' field
is left intact. The clearing of the '->dom' is left to a), b) or again
c) case.
Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>